{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The penguins datasets\n",
    "\n",
    "In this notebook, we make a quick presentation of the [Palmer penguins\n",
    "dataset](https://allisonhorst.github.io/palmerpenguins/) dataset. We use this\n",
    "dataset for both classification and regression problems by selecting a subset\n",
    "of the features to make our explanations intuitive.\n",
    "\n",
    "## Classification dataset\n",
    "\n",
    "We use this dataset in classification setting to predict the penguins'\n",
    "species from anatomical information.\n",
    "\n",
    "Each penguin is from one of the three following species: Adelie, Gentoo, and\n",
    "Chinstrap. See the illustration below depicting the three different penguin\n",
    "species:\n",
    "\n",
    "![Image of\n",
    "penguins](https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/lter_penguins.png)\n",
    "\n",
    "This problem is a classification problem since the target is categorical. We\n",
    "limit our input data to a subset of the original features to simplify our\n",
    "explanations when presenting the decision tree algorithm. Indeed, we use\n",
    "features based on penguins' culmen measurement. You can learn more about the\n",
    "penguins' culmen with the illustration below:\n",
    "\n",
    "![Image of\n",
    "culmen](https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/culmen_depth.png)\n",
    "\n",
    "We start by loading this subset of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
    "\n",
    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_column = \"Species\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the dataset more into details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "penguins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since that we have few samples, we can check a scatter plot to observe the\n",
    "samples distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "pairplot_figure = sns.pairplot(penguins, hue=\"Species\")\n",
    "pairplot_figure.fig.set_size_inches(9, 6.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's check the feature distributions by looking at the diagonal plots\n",
    "of the pairplot. We can deduce the following intuitions:\n",
    "\n",
    "* The Adelie species can be differentiated from the Gentoo and Chinstrap\n",
    "  species depending on the culmen length;\n",
    "* The Gentoo species can be differentiated from the Adelie and Chinstrap\n",
    "  species depending on the culmen depth.\n",
    "\n",
    "## Regression dataset\n",
    "\n",
    "In a regression setting, the target is a continuous variable instead of\n",
    "categories. Here, we use two features of the dataset to make such a problem:\n",
    "the flipper length is used as data and the body mass as the target. In short,\n",
    "we want to predict the body mass using the flipper length.\n",
    "\n",
    "We load the dataset and visualize the relationship between the flipper length\n",
    "and the body mass of penguins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_column = \"Body Mass (g)\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sns.scatterplot(data=penguins, x=feature_name, y=target_column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we deal with a regression problem because our target is a continuous\n",
    "variable ranging from 2.7 kg to 6.3 kg. From the scatter plot above, we\n",
    "observe that we have a linear relationship between the flipper length and the\n",
    "body mass. The longer the flipper of a penguin, the heavier the penguin."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}